Improving Speech Recognizers by Refining Broadcast Data with Inaccurate Subtitle Timestamps
نویسندگان
چکیده
This paper proposes an automatic method to refine broadcast data collected every week for efficient acoustic model training. For training acoustic models, we use only audio signals, subtitle texts, and subtitle timestamps accompanied by recorded broadcast programs. However, the subtitle timestamps are often inaccurate due to inherent characteristics of closed captioning. In the proposed method, we remove subtitle texts with low subtitle quality index, concatenate adjacent subtitle texts into a merged subtitle text, and correct the timestamp of the merged subtitle text by adding a margin. Then, a speech recognizer is used to obtain a hypothesis text from the speech segment corresponding to the merged subtitle text. Finally, the refined speech segments to be used for acoustic model training, are generated by selecting the subparts of the merged subtitle text that matches the hypothesis text. It is shown that the acoustic models trained by using refined broadcast data give significantly higher speech recognition accuracy than those trained by using raw broadcast data. Consequently, the proposed method can efficiently refine a large amount of broadcast data with inaccurate timestamps taking about half of the time, compared with the previous approaches.
منابع مشابه
Improved ROVER using Language Model Information
In the standard approach to speech recognition, the goal is to find the sentence hypothesis that maximizes the posterior probability of the word sequence given the acoustic observation. Usually speech recognizers are evaluated by measuring the word error so that there is a mismatch between the training and the evaluation criterion. Recently, algorithms for minimizing directly the word error and...
متن کاملCombining multiple speech recognizers using voting and language model information
In 1997, NIST introduced a voting scheme called ROVER for combining word scripts produced by different speech recognizers. This approach has achieved a relative word error reduction of up to 20% when used to combine the systems’ outputs from the 1998 and 1999 Broadcast News evaluations. Recently, there has been increasing interest in using this technique. This paper provides an analysis of seve...
متن کاملReal-time speech-generated subtitles: problems and solutions
This paper refers to work carried out in the Subspeak project [1] in which we are investigating the use of speech recognition in live television subtitling. Research to date has shown that with current speech recognition technology it is not possible to achieve a satisfactory level of accuracy in the direct transcription of broadcast material. To circumvent this problem in our system the broadc...
متن کاملCombining forward-based and backward-based decoders for improved speech recognition performance
Combining outputs of speech recognizers is a known way of increasing speech recognition performance. The ROVER approach handles efficiently such combinations. In this paper we show that the best performance is not achieved by combining the outputs of the best set of recognizers, but rather by combining outputs of recognizers that rely on different processing components, and in particular on a d...
متن کاملLightly supervised and unsupervised acoustic model training
The last decade has witnessed substantial progress in speech recognition technology, with todays state-of-the-art systems being able to transcribe unrestricted broadcast news audio data with a word error of about 20%. However, acoustic model development for these recognizers relies on the availability of large amounts of manually transcribed training data. Obtaining such data is both time-consu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017